FlexDM: Enabling robust and reliable parallel data mining using WEKA

نویسندگان

  • Madison Flannery
  • David Budden
  • Alexandre Mendes
چکیده

Performing massive data mining experiments with multiple datasets and methods is a common task faced by most bioinformatics and computational biology laboratories. WEKA is a machine learning package designed to facilitate this task by providing tools that allow researchers to select from several classification methods and specific test strategies. Despite its popularity, the current WEKA environment for batch experiments, namely Experimenter, has four limitations that impact its usability: the selection of value ranges for methods options lacks flexibility and is not intuitive; there is no support for parallelisation when running large-scale data mining tasks; the XML schema is difficult to read, necessitating the use of the Experimenter’s graphical user interface for generation and modification; and robustness is limited by the fact that results are not saved until the last test has concluded. FlexDM implements an interface to WEKA to run batch processing tasks in a simple and intuitive way. In a short and easy-to-understand XML file, one can define hundreds of tests to be performed on several datasets. FlexDM also allows those tests to be executed asynchronously in parallel to take advantage of multi-core processors, significantly increasing usability and productivity. Results are saved incrementally for better robustness and reliability. FlexDM is implemented in Java and runs on Windows, Linux and OSX. As we encourage other researchers to explore and adopt our software, FlexDM is made available as a pre-configured bootable reference environment. All code, supporting documentation and usage examples are also available for download at http://sourceforge.net/projects/flexdm.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Progress Report on “ Big Data Mining ”

Big Data consists of voluminous, high-velocity and high-variety datasets that are increasingly difficult to process using traditional methods. Data Mining is the process of discovering knowledge by analysing raw datasets. Traditional Data Mining tools, such as Weka and R, have been designed for single-node sequential execution and fail to cope with modern Big Data volumes. In contrast, distribu...

متن کامل

Performance Improvement of Data Mining in Weka through GPU Acceleration

Data mining tools may be computationally demanding, so there is an increasing interest on parallel computing strategies to improve their performance. The popularization of Graphics Processing Units (GPUs) increased the computing power of current desktop computers, but desktop-based data mining tools do not usually take full advantage of these architectures. This paper exploits an approach to im...

متن کامل

Inhambu: Data Mining Using Idle Cycles in Clusters of PCs

In this paper we present and evaluate Inhambu, a distributed objectoriented system that relies on dynamic monitoring to collect information about the availability of computational resources, providing the necessary support for the execution of data mining applications on clusters of PCs and workstations. We also describe a modified implementation of the data mining tool Weka, which executes the...

متن کامل

Analysis and Design of Service-Oriented Framework for Executing Data Mining Services on Grids

Data mining services on grids is the need of today’s era. Workflow environments are widely used in data mining systems to manage data and execution flows associated to complex applications. Weka, one of the most used open-source data mining systems, includes the Knowledge-Flow environment which provides a drag-and-drop inter-face to compose and execute data mining workflows. It allows users to ...

متن کامل

Comparison of Different Classification Techniques Using WEKA for Hematological Data

ABSTRAC : Medical professionals need a reliable prediction methodology to diagnose hematological data comments. There are large quantities of information about patients and their medical conditions. Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information. Data mining software is...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1412.5720  شماره 

صفحات  -

تاریخ انتشار 2014